Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] fix g2 machine memory calculation #14498

Merged
merged 13 commits into from
May 3, 2024
Merged

Conversation

sjparsa
Copy link
Collaborator

@sjparsa sjparsa commented Apr 23, 2024

The memory that machines use is calculated based off of the number of cores the machine has, with a fix ratio set of cores to Gb of RAM. The code assumed a ratio of 3.75 cores: 1 Gb but this assumption does not hold outside of the n1 machine family. This PR changes the functions that calculate memory from number of cores to account for this by passing the machine_type as the main parameter rather than the worker type. Then, logic is added to use the appropriate ratio of 4:1 for the g2 family of machines.

@daniel-goldstein daniel-goldstein changed the title Gpu fix [batch] fix g2 machine memory calculation Apr 26, 2024
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving my comments from our discussion yesterday. This looks great and I'm really happy we caught this bug. Just a few comments on the organization.

It would also be great to have some unit testing around the problematic function: gcp_worker_memory_per_core_mib. Take a look at batch/test/test_utils.py, it contains a couple unit tests that just test the behavior of a couple of individual functions. You can run that test file with pytest batch/test/test_utils.py. This would be a good place to add tests for gcp_worker_memory_per_core_mib. We should cover at least a handful of n1s and g2s and some invalid / not-yet-supported machine types that should raise some sort of error.

batch/batch/cloud/gcp/resource_utils.py Outdated Show resolved Hide resolved
batch/batch/cloud/azure/instance_config.py Outdated Show resolved Hide resolved
batch/batch/cloud/gcp/instance_config.py Outdated Show resolved Hide resolved
batch/batch/cloud/gcp/resource_utils.py Outdated Show resolved Hide resolved
batch/batch/cloud/resource_utils.py Outdated Show resolved Hide resolved
batch/batch/front_end/front_end.py Outdated Show resolved Hide resolved
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really great! I think there's one spot that was missed w.r.t. hard-coding n1s and then a little more cleanup, but we're getting really close.

batch/batch/cloud/gcp/instance_config.py Outdated Show resolved Hide resolved
batch/batch/inst_coll_config.py Outdated Show resolved Hide resolved
Sophie Parsa added 2 commits May 2, 2024 21:16
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for all your work fixing this sneaky bug!

@hail-ci-robot hail-ci-robot merged commit c093074 into hail-is:main May 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants